Text Retrieval using Linear Algebra
نویسندگان
چکیده
Text retrieval is an important area of research. As information and methods of its storage have proliferated, the need to have efficient methods of locating subsets of this information has increased as well. A widely-researched text searching method involves modeling a text collection in a term-by-document matrix, and evaluating the documents’ relevance to a query with simple linear algebra. This document is an abstract of research performed throughout 2001 and 2002, and presents one such system, developed to search the bsu-cs webserver. In this method, a matrix A is constructed such that the frequency of each term i in each document j, is stored at Ai,j . To build this matrix for bsu-cs, a simple index-creator was programmed. Beginning with a list of known Web pages, the script analyzed each document encountered. Word frequencies were recorded, and links to other bsu-cs Web pages were extracted. Those links were both added to the page list, and tabulated as possible indicators of document importance (more links to page x implies that page x is more important). Note that common words such as “the” and “how” have little intrinsic meaning [1]. Therefore, a list of “stop words” was created. The members of this list were excluded by the matrix-forming code. After several hours of parsing, the indexcreator, in conjunction with a script to organize each page’s raw data, produced a 26257 by 3557, sparse matrix. If a user supplies a list of key words, the relevance of those words to each of the documents in the term-by-document matrix A = [~a1,~a2, . . . ,~an] can be determined. A common measure of such is the cosine of the angle between the query vector, ~q and each document (column) vector, ~aj [3]. If the query is represented by a vector ~q and the term-by-document matrix as A, the cosine of the angle between ~q and a document ~aj is found by:
منابع مشابه
A Linear-Algebraic Technique with an Application in Semantic Image Retrieval
This paper presents a novel technique for learning the underlying structure that links visual observations with semantics. The technique, inspired by a text-retrieval technique known as cross-language latent semantic indexing uses linear algebra to learn the semantic structure linking image features and keywords from a training set of annotated images. This structure can then be applied to unan...
متن کاملTMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm's from text collections and for the incremental modificatio...
متن کاملImage retrieval using the combination of text-based and content-based algorithms
Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...
متن کاملDesign of a MATLAB toolbox for term-document matrix generation
Data clustering and many other fundamental operations in data mining and information retrieval are built using computational kernels from numerical linear algebra and operate on very large, sparse term-document matrices. To facilitate these tasks, we have built TMG, a toolbox for the generation and incremental modification of term-document matrices from text collections. The toolbox is written ...
متن کاملUsing Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine
Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...
متن کاملA Ranking Model of Proximal and Structural Text Retrieval Based on Region Algebra
This paper investigates an application of the ranked region algebra to information retrieval from large scale but unannotated documents. We automatically annotated documents with document structure and semantic tags by using taggers, and retrieve information by specifying structure represented by tags and words using ranked region algebra. We report in detail what kind of data can be retrieved ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003